import pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)# Load and clean datasetdf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")df = df.replace(["", "NA", None], np.nan)# Add total_delay column (required for multiple tasks)df["total_delay"] = ( df["minutes_delayed_carrier"].fillna(0) + df["minutes_delayed_weather"].fillna(0) + df["minutes_delayed_nas"].fillna(0) + df["minutes_delayed_security"].fillna(0) + df["minutes_delayed_late_aircraft"].fillna(0))
Elevator pitch
The analysis reveals that IAD and SAN airports experience the highest proportions of flight delays, with weather being the leading cause. While carrier and security delays are close behind in frequency, weather delays are slightly more prevalent. This insight can help prioritize weather-related contingency plans.
QUESTION|TASK 1
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). Include one raw JSON record with "NaN" for at least one missing field.
Analysis:
This code replaces inconsistent indicators ("", "NA", None) with NaN. A JSON row is printed to verify that missing values are now displayed as "NaN" using pandas.
Show the code
import pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)# Load the datasetdf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")# Standardize missing values and show a raw JSON rowdf = df.replace(["", "NA", None], np.nan)# Create a total delay column for later questionsdf["total_delay"] = ( df["minutes_delayed_carrier"].fillna(0) + df["minutes_delayed_weather"].fillna(0) + df["minutes_delayed_nas"].fillna(0) + df["minutes_delayed_security"].fillna(0) + df["minutes_delayed_late_aircraft"].fillna(0))# Print a JSON row with NaN to verifyprint(df.loc[df.isna().any(axis=1)].iloc[0].to_json())
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
On average, carrier delays cause the largest share of flight delays (~0.008%), followed closely by weather (~0.007%). Security delays lag slightly at (~0.006%). The chart confirms this trend, with carrier delays visually dominating most airports.
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why. Include one chart to support your answer, with the x-axis ordered by month.
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
type your results and analysis here
Show the code
import pandas as pdimport numpy as np# Load datadf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")# Replace missing value markersdf = df.replace(["", "NA", None], np.nan)# Create total_delay columndf["total_delay"] = ( df["minutes_delayed_carrier"].fillna(0) + df["minutes_delayed_weather"].fillna(0) + df["minutes_delayed_nas"].fillna(0) + df["minutes_delayed_security"].fillna(0) + df["minutes_delayed_late_aircraft"].fillna(0))mean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)print("Filled missing late aircraft delays with mean:", mean_late)def calc_weather(row): wa = row["minutes_delayed_weather"] la =0.30* row["minutes_delayed_late_aircraft"] nas_factor =0.40if row["month"] in ["April", "May", "June", "July", "August"] else0.65 na = nas_factor * row["minutes_delayed_nas"]return wa + la + nadf["weather_delay_est"] = df.apply(calc_weather, axis=1)df[["airport_code", "month","minutes_delayed_weather", "minutes_delayed_late_aircraft","minutes_delayed_nas", "weather_delay_est"]].head()
Filled missing late aircraft delays with mean: 75511.96645021645
airport_code
month
minutes_delayed_weather
minutes_delayed_late_aircraft
minutes_delayed_nas
weather_delay_est
0
ATL
January
36931
104415
207467.0
203109.05
1
DEN
January
21779
70301
36817.0
66800.35
2
IAD
January
4497
70919
35660.0
48951.70
3
ORD
January
24859
160811
364382.0
309950.60
4
SAN
January
4326
38445
21127.0
29592.05
QUESTION|TASK 5
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
IAD and SAN show the highest weather delay proportions (~0.015%), with most airports falling between 0.005%–0.010%. Even though weather delays are infrequent overall, they dominate at the worst-performing airports—confirming their impact and guiding where to target weather‑response plans.
Show the code
from lets_plot import*import pandas as pdimport numpy as npfrom lets_plot import*# includes ggplot, aes, labs, theme_minimal, etc.LetsPlot.setup_html(isolated_frame=True)# Load and clean datadf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")df = df.replace(["", "NA", None], np.nan)# Total delay columndf["total_delay"] = ( df["minutes_delayed_carrier"].fillna(0) + df["minutes_delayed_weather"].fillna(0) + df["minutes_delayed_nas"].fillna(0) + df["minutes_delayed_security"].fillna(0) + df["minutes_delayed_late_aircraft"].fillna(0))# Fill missing late aircraft delay with meanmean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)# Estimate weather delaydef calc_weather(row): wa = row["minutes_delayed_weather"] la =0.30* row["minutes_delayed_late_aircraft"] nas_factor =0.40if row["month"] in ["April", "May", "June", "July", "August"] else0.65 na = nas_factor * row["minutes_delayed_nas"]return wa + la + nadf["weather_delay_est"] = df.apply(calc_weather, axis=1)# Calculate weather proportionweather_prop = ( df.groupby("airport_code") .agg( total_flights=("num_of_flights_total", "sum"), weather_mins=("weather_delay_est", "sum") ) .assign(prop_weather=lambda d: d["weather_mins"] / (d["total_flights"] *60)) .reset_index() .sort_values("prop_weather", ascending=False))# Plotp_weather = ( ggplot(weather_prop, aes(x="airport_code", y="prop_weather"))+ geom_bar(stat="identity", fill="#4682B4") # hex for steelblue+ labs( title="Proportion of Flights Delayed by Weather (Estimated)", x="Airport Code", y="Delay Proportion" )+ theme_minimal())p_weather
STRETCH QUESTION|TASK 1
Which delay is the worst delay? Create a similar analysis as above for Weather Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
Show the code
import pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)# Load and clean datasetdf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")df = df.replace(["", "NA", None], np.nan)# Create total_delay columndf["total_delay"] = ( df["minutes_delayed_carrier"].fillna(0) + df["minutes_delayed_weather"].fillna(0) + df["minutes_delayed_nas"].fillna(0) + df["minutes_delayed_security"].fillna(0) + df["minutes_delayed_late_aircraft"].fillna(0))# Fill late aircraft delay with meanmean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)# Compute estimated weather delaydef calc_weather(row): wa = row["minutes_delayed_weather"] la =0.30* row["minutes_delayed_late_aircraft"] nas_factor =0.40if row["month"] in ["April", "May", "June", "July", "August"] else0.65 na = nas_factor * row["minutes_delayed_nas"]return wa + la + nadf["weather_delay_est"] = df.apply(calc_weather, axis=1)# Aggregate delaysdelay_types = ( df.groupby("airport_code") .agg( total_flights=("num_of_flights_total", "sum"), weather_delay=("weather_delay_est", "sum"), carrier_delay=("minutes_delayed_carrier", "sum"), security_delay=("minutes_delayed_security", "sum") ))# Calculate proportionsdelay_types = delay_types.assign( prop_weather=delay_types["weather_delay"] / (delay_types["total_flights"] *60), prop_carrier=delay_types["carrier_delay"] / (delay_types["total_flights"] *60), prop_security=delay_types["security_delay"] / (delay_types["total_flights"] *60)).reset_index()# Prepare for plotplot_df = delay_types.melt( id_vars="airport_code", value_vars=["prop_carrier", "prop_weather", "prop_security"], var_name="delay_category", value_name="proportion")# Plotp_delay = ( ggplot(plot_df, aes(x="airport_code", y="proportion", fill="delay_category"))+ geom_bar(stat="identity", position="dodge")+ labs( title="Comparison of Delay Proportions by Category and Airport", x="Airport Code", y="Proportion of Flights Delayed", fill="Delay Category" )+ theme_minimal())p_delay
*Insight (Stretch Task): This chart shows that, on average, Carrier delays have the highest proportion across airports, while Security delays tend to have the lowest. However, Weather delays can exceed carrier delays at specific airports — like IAD — highlighting its localized impact.